What We'll Cover
Last session, we dissected the transformer architecture—the structure of modern LLMs. But how do billions of parameters actually learn to generate coherent, knowledgeable text? This session covers the training process: how we go from random weights and massive text corpora to models that can write, reason, and assist with research.
We'll explore the pre-training objective, the scale of data required, the optimization techniques that make training feasible, and the staggering computational costs involved. We'll also examine scaling laws that predict model performance and the parallelism strategies that enable training at unprecedented scales.
Key question: Why does training a frontier model cost tens of millions of dollars—and what are researchers buying with that money?
🎯 Pre-Training Objectives
Before we can train a model, we need a learning objective: what task should the model solve during training? Modern LLMs primarily use one deceptively simple objective.
🔑 Next-Token Prediction
The dominant pre-training objective for modern decoder-only LLMs: given a sequence of tokens, predict the next token.
In many training pipelines this is the main loss, and it’s powerful: applied to massive corpora, it encourages the model to capture grammar, factual associations, and patterns of reasoning. Some training setups also add auxiliary losses or curriculum/mixing strategies (especially for multimodal or retrieval-augmented systems).
Autoregressive Language Modeling
The dominant approach for decoder-only models (GPT, Claude, LLaMA):
- Input: Sequence of tokens [t₁, t₂, ..., tₙ]
- Task: Predict tₙ₊₁ given [t₁, ..., tₙ]
- Training signal: Cross-entropy loss between predicted distribution and actual next token
- Causal masking: Model can only see previous tokens, not future ones
- Example: "The capital of France is" → model should assign high probability to "Paris"
Why This Works
Next-token prediction seems simple, but it pushes the model to represent:
- Syntax: Grammatical structure, word order, punctuation
- Semantics: Context-dependent meaning and reference
- Regularities: Statistical structure of language, code, and math-like text
- Factual patterns: Associations present in the corpus (not guaranteed truth)
- Inference patterns: Common reasoning templates that help predict what comes next
💡 Self-Supervised Learning
No human labels needed—the text itself provides supervision. This is why LLMs can train on very large corpora without manual annotation.
Alternative Objectives (Historical and Still Useful)
Other pre-training approaches:
- Masked Language Modeling (BERT): Mask random tokens, predict them from context (bidirectional)
- Encoder-Decoder (T5): Span corruption/reconstruction tasks
- Prefix LM: Hybrid approach with bidirectional prefix and autoregressive suffix
Why decoder-only is common: Simple, scales well, and is directly suited to open-ended generation. That said, encoder–decoder and masked objectives remain competitive for some tasks (e.g., translation, certain structured generation settings).
📹 A little more on next-token prediction
📚 Training Data: Scale, Quality, and Curation
The data you train on determines what your model can pick up. Modern LLMs are trained on enormous token counts from diverse sources—but data quality and composition matter as much as quantity.
📊 Data Scale Evolution
Training data has grown dramatically:
| Model | Year | Training Tokens | Data Sources |
|---|---|---|---|
| GPT-2 | 2019 | ~8 billion | WebText (outbound Reddit links) |
| GPT-3 | 2020 | ~300 billion | Filtered Common Crawl, WebText2, Books, Wikipedia (mixture described in the paper) |
| GPT-4 | 2023 | ~13 trillion (unreported, speculative) | Undisclosed; multimodal data included in GPT-4V-style systems |
| LLaMA 2 | 2023 | 2 trillion | Publicly available data; no Meta user data (per authors) |
| LLaMA 3 | 2024 | ~15 trillion | More diverse sources; stronger filtering and quality controls (per authors) |
Trend (often claimed): As public web text becomes noisier and more saturated, labs increasingly emphasize filtering, licensing, domain balancing, and synthetic data. The extent of “data scarcity” depends heavily on definitions and access to private/licensed corpora.
Common Data Sources
- Web crawls: Common Crawl, C4 (Colossal Clean Crawled Corpus)
- Books: Project Gutenberg, other book corpora (often legally and ethically contentious)
- Wikipedia: High-quality encyclopedic knowledge
- Code repositories: GitHub, Stack Overflow (for coding + structured text patterns)
- Scientific papers: arXiv, PubMed, and academic corpora
- Forums/conversations: Reddit and other forums (with heavy filtering)
Data Curation Challenges
- Quality filtering: Removing spam, boilerplate, and low-information text
- Deduplication: Reducing repeats that inflate token counts without adding signal
- Toxicity: Filtering harmful or unsafe content (imperfectly)
- Privacy: Removing personal/sensitive information
- Copyright: Legal questions about using copyrighted material
- Contamination: Preventing evaluation/test sets from leaking into training data
Data Composition Matters
It's not just volume—the mix of data types affects model behavior:
- Code data: Often correlates with better performance on structured reasoning tasks (causal mechanism debated)
- Mathematical text: Can improve symbolic/quantitative patterns (often still brittle)
- Multilingual data: Enables cross-lingual transfer and broader coverage
- Instruction-like data: Some pipelines include instruction-following examples earlier than “alignment” (varies by lab)
- Balance: Over-representing one domain biases style and knowledge
🔧 Tokenization: Preparing Text for Training
Before training, text must be converted to tokens:
- BPE (Byte-Pair Encoding): Common for GPT/LLaMA-family tokenizers. Iteratively merges frequent symbol pairs.
- WordPiece: Used by BERT-style models; similar spirit to BPE with a different objective.
- SentencePiece: A tokenizer toolkit that can implement BPE or Unigram LM; trains from raw text without pre-tokenization, often with byte fallback options.
- Vocabulary size: Often 32K–100K tokens. Larger vocab can reduce sequence length but increases embedding/softmax sizes.
Trade-off: Finer tokens = more flexible, longer sequences. Coarser tokens = shorter sequences, less flexibility for rare words/morphology.
📄 Reading Resources: Data Curation, Scaling Laws, and Data Mixtures
🧠 Core Papers
- Chinchilla scaling laws / compute-optimal training
Training Compute-Optimal Large Language Models (Hoffmann et al., 2022).
Key idea: for a fixed compute budget, many models are under-trained on too few tokens; compute-optimal regimes often favor training on more tokens per parameter.
arXiv:2203.15556 - Earlier scaling laws
Scaling Laws for Neural Language Models (Kaplan et al., 2020).
Establishes empirical scaling relationships and provides a baseline lens for thinking about compute/data/model size regimes (pre-Chinchilla).
arXiv:2001.08361 - GPT-3 data mixture
Language Models are Few-Shot Learners (Brown et al., 2020).
Describes the training data mixture (filtered Common Crawl + WebText2 + Books + Wikipedia) and why mixture design matters.
arXiv:2005.14165 - LLaMA 2 data + filtering
Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023).
Data sources, filtering, and high-level curation choices from a major open model.
arXiv:2307.09288
- GPT-4 Technical Report
Not detailed on datasets, but useful for seeing what major labs disclose and how they frame data/safety constraints at a high level.
arXiv:2303.08774
🎓 Practical Guides / Hands-On References
- Hugging Face Course (data prep + tokenization basics)
Tokenization, dataset preparation, and evaluation hygiene.
huggingface.co/learn - Instruction-tuning dataset construction (for curation intuition)
Alpaca-style pipelines illustrate balancing, formatting, and contamination pitfalls (more finetuning than pretraining).
Stanford Alpaca (GitHub)
📹 LLM Tokenizers explained
⚙️ Optimization: Making Training Work
Training a billion-parameter model requires sophisticated optimization techniques. You can't just run vanilla gradient descent and expect it to work!
🎯 The Optimization Challenge
Training LLMs means finding good values for billions of parameters in a high-dimensional space. The loss landscape is non-convex, and training is computationally expensive and sensitive to hyperparameters.
Modern LLM training relies on: adaptive optimizers (Adam variants), careful learning rate schedules, large global batch sizes, and gradient clipping to reduce instability.
Optimizers
Algorithms for updating parameters based on gradients:
- SGD: Foundational but often slower/harder to tune at LLM scale.
- Adam: Adaptive learning rates per parameter; common baseline.
- AdamW: Adam with decoupled weight decay; a standard choice for LLM training.
- Adafactor: Memory-efficient variant used in some very large models.
- Lion: A recent optimizer sometimes explored for efficiency; usage varies and is not universal.
💡 Why Adam-style optimizers?
They maintain momentum and adaptive step sizes, which often stabilizes training across very heterogeneous parameter scales.
Learning Rate Schedules
Learning rate determines step size during optimization. Too high = instability; too low = slow convergence.
- Warmup: Start small and ramp up (reduces early instability)
- Cosine decay: Common after warmup
- Linear decay: Another standard schedule
- Typical peak LR: often around 1e-4 to 3e-4 in many public recipes (depends strongly on batch size/model size)
Batch Size & Gradient Accumulation
- Batch size: Number of sequences/tokens processed before updating parameters
- Large global batches: In many large-scale runs the global batch corresponds to millions of tokens, but it varies by model and hardware
- Why large? Stabilizes training and improves hardware utilization
- Gradient accumulation: If you can't fit the batch, accumulate gradients over multiple forward passes
- Trade-off: Extremely large batches can alter generalization dynamics; effects depend on regime
🛡️ Preventing Training Instability
Training can diverge due to exploding gradients or numerical issues. Common mitigations:
- Gradient clipping: Cap gradient norm (often around 1.0, but not universal)
- Mixed precision: Use lower precision for speed, with higher precision accumulation to reduce under/overflow
- Normalization: LayerNorm/RMSNorm stabilizes activations
- Warmup: Gradual LR ramp prevents early divergence
- Checkpointing: Reload from recent stable checkpoint after spikes/failures
Reality check: Large-scale training can still fail; robust checkpointing and monitoring are essential.
📹 ChatGPT 5.2 explanation of the Adam Optimizer and Learning Rate Scheduling
Here you will find the explanation that I got to the above using ChatGPT 5.2. I have a very specific preprompt that I use. Try the same prompt in different AIs and see how good, or bad, each one is.
💰 Computational Resources & Costs
Training frontier LLMs requires staggering amounts of compute. Let's quantify exactly what that means.
🔢 Understanding FLOPs
FLOP = Floating Point Operation (a single addition or multiplication)
A common dense-transformer back-of-the-envelope estimate:
- Training compute: ≈ 6 FLOPs per parameter per token
- Forward pass: ≈ 2 FLOPs/param/token
- Backward pass: ≈ 4 FLOPs/param/token
Caveat: Real compute differs with architecture and implementation (attention vs MLP ratios, sequence length effects, activation checkpointing, MoE routing, etc.). Use this as an order-of-magnitude estimate.
Example: Training a 7B parameter model on 1T tokens ≈ 6 × 7B × 1T = 42 × 10²¹ FLOPs = 42 zettaFLOPs
💸 Cost Estimates for Famous Models
Compute requirements and estimated costs (highly approximate):
| Model | Parameters | Training Tokens | Compute (FLOPs) | GPU-Days (A100) | Estimated Cost |
|---|---|---|---|---|---|
| GPT-3 175B | 175B | 300B | ~3.1 × 10²³ | ~11,500 | ~$5–10M |
| LLaMA 2 70B | 70B | 2T | ~8.4 × 10²³ | ~31,000 | ~$15–20M |
| GPT-4 (speculative dense equivalent) | ~1.76T (unconfirmed) | ~13T (unconfirmed) | ~1.4 × 10²⁶ | ~5,000,000 | ~$100M+ |
| Claude Opus 4.5 (speculative dense equivalent) | ~500B (unconfirmed) | ~10T (unconfirmed) | ~3 × 10²⁵ | ~1,100,000 | ~$50–70M |
Note: These are rough estimates. Modern frontier models may use Mixture-of-Experts (MoE), where “parameters” and “active compute per token” diverge. Costs also depend on utilization, restarts, networking overheads, and whether compute is rented or owned.
Hardware Requirements
- GPUs: Thousands of A100/H100/H200 GPUs running in parallel
- Interconnect: High-bandwidth NVLink / InfiniBand for GPU communication
- Storage: Large, fast storage for streaming training data
- Clusters: Datacenters with substantial power and cooling constraints
- Cost per GPU-hour (cloud): Varies widely by provider/region/commitment; treat any single number as a moving target
Training Time
How long does it take?
- Small models (7B): Days to weeks on moderate clusters
- Mid-size (70B): Weeks to months on large clusters
- Frontier (1T+): Months on massive clusters (10K+ GPUs)
- Bottleneck: Often GPU communication, not raw compute
- Restarts: Training runs fail; checkpointing every few hours is critical
⚡ Energy Consumption
Energy use depends on hardware, datacenter efficiency (PUE), utilization, and retries. Published numbers vary substantially across sources and assumptions.
Rule of thumb for impact discussions: inference can dominate total energy/carbon footprint for widely deployed models, but this depends on deployment scale and usage patterns.
📈 Scaling Laws: Predicting Performance
Can we predict how good a model will be before spending millions on training? Scaling laws say yes—with important caveats.
🔮 The Scaling Law Hypothesis
Model performance (often measured by loss on held-out data) follows predictable power laws as a function of:
- Model size (N): Number of parameters
- Dataset size (D): Number of training tokens
- Compute (C): FLOPs spent on training
These relationships let labs extrapolate performance and plan expensive training runs.
Kaplan Scaling Laws (2020)
OpenAI's early scaling-law analysis suggested:
- Smooth power laws: Test loss decreases predictably with scale
- Weak sensitivity to shape: Depth vs width mattered less than total scale in their regime
- Overfitting was limited: With large datasets, larger models continued improving
- Takeaway (in that regime): Scaling model size was an effective lever
Chinchilla Scaling Laws (2022)
DeepMind revised the story: data matters more than previously assumed.
- Compute-optimal training: Model size and tokens should scale together
- Rule of thumb: ~20 tokens per parameter for compute-optimal dense training (in their setting)
- GPT-3 undertrained (by that criterion): 175B params on 300B tokens; compute-optimal would use more tokens
- Chinchilla: 70B params on ~1.4T tokens outperformed some larger-but-undertrained models
💡 The Chinchilla Insight
If you have a fixed compute budget, it can be better to train a smaller model on more data than a huge model on too little data.
Implications for Research
- Predictability: Early loss curves can often forecast final loss
- Allocation choices: Labs trade off model size vs tokens based on product constraints
- Algorithmic progress: Architecture/optimizer improvements matter, but scale remains a dominant driver
- Sharp transitions: Some benchmarks show abrupt-looking capability jumps; whether these are intrinsic or evaluation-dependent is debated
📊 Training for Compute vs Training for Inference Cost
“Compute-optimal” usually means best loss for a fixed training compute budget. Product teams may instead optimize for low inference cost at a target quality, which can justify extra training (more tokens) to make a smaller model good enough.
| Consideration | Compute-Optimal (fixed training compute) | Inference-Targeted (fixed target quality) |
|---|---|---|
| Model size | Often smaller (paired with more tokens) | Often smaller (to reduce inference cost), but trained longer |
| Training tokens | Scaled with parameters (e.g., ~20 tokens/param in Chinchilla regime) | May exceed compute-optimal tokens to reach a target quality with a smaller model |
| Training cost | Optimized for given budget | Potentially higher (extra training to reduce deployment costs) |
| Inference cost | Not the primary objective; depends on the resulting model size | Explicitly minimized (smaller model for a target quality) |
| Best for | Research planning; compute budgeting; fast iteration | Deployed systems where serving cost dominates |
Example framing: A lab might spend extra training compute to make a smaller model reach the desired quality, because the smaller model is dramatically cheaper to serve at scale.
🔀 Parallelism: Training at Scale
A single GPU can't hold a 70B parameter model, let alone train it. Modern LLM training requires distributing the model and data across many GPUs using parallelism strategies.
Data Parallelism
The simplest approach: replicate the model across GPUs and split data across them.
- Setup: Each GPU holds a complete model replica
- Process: Different GPUs process different batches
- Synchronization: Aggregate gradients across GPUs
- Pros: Simple, good utilization
- Cons: Memory-limited—very large models won’t fit on one GPU
- Used for: Smaller models or when combined with sharding
Model Parallelism (Tensor Parallel)
Split individual layers across GPUs.
- Setup: Each transformer layer split across multiple GPUs
- Example: Attention/MLP matrices split across GPUs
- Process: Frequent intra-layer communication
- Pros: Can scale to very large models
- Cons: Communication overhead
- Used for: Models too large for a single GPU
Pipeline Parallelism
Split model into stages—each GPU handles specific layers.
- Setup: Layers partitioned into pipeline stages
- Microbatching: Keeps all stages busy
- Pros: Less intra-layer comms than tensor parallel
- Cons: Pipeline “bubbles” (idle time), scheduling complexity
- Used for: Very large models (often with tensor parallel)
🎯 Combining Parallelism: 3D Parallelism
Frontier training often combines multiple strategies:
Example: Training a 175B model on 1024 GPUs
- Data parallelism: 16-way
- Pipeline parallelism: 8-way
- Tensor parallelism: 8-way
- Total: 16 × 8 × 8 = 1024 GPUs
Challenge: Tuning these dimensions is nontrivial: too much tensor parallelism can bottleneck on communication; too much pipeline parallelism increases bubble overhead.
🛠️ Frameworks & Tools
Libraries that implement these parallelism strategies:
- DeepSpeed (Microsoft): ZeRO optimizations and multi-parallel strategies
- Megatron-LM (NVIDIA): Efficient tensor and pipeline parallelism for transformers
- FSDP (PyTorch): Fully Sharded Data Parallel (sharding + data parallel concepts)
- JAX/Flax: Flexible parallelism used in some large-scale research systems
📹 GPU explainer
📹 JAX explainer
📚 Summary & Key Takeaways
You now understand how LLMs are trained at scale:
- Pre-training objective: Next-token prediction is the dominant driver for decoder-only LMs (sometimes with additional training tricks)
- Training data: Massive corpora from curated web, books, code, papers—quality and composition matter
- Optimization: AdamW + LR schedules + large global batches + stability techniques
- Compute costs: Large-scale training requires enormous compute; exact costs depend on hardware and efficiency
- Scaling laws: Loss often scales predictably with compute; Chinchilla emphasizes balancing model size and tokens
- Parallelism: Data + tensor + pipeline parallelism distributes training across many GPUs
Next session (Week 2.3): We'll explore what happens after pre-training—fine-tuning, RLHF, and alignment techniques that turn raw language models into helpful AI assistants.